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Discrete Bayesian Networks (BN's) have been very successful as 
a framework both for inference and for expressing certain causal hy- 
potheses. In this paper we present a class of graphical models called 
the chain event graph (CEG) models, that generalises the class of 
discrete BN models. It provides a flexible and expressive framework 
for representing and analysing the implications of causal hypotheses, 
expressed in terms of the effects of a manipulation of the generating 
underlying system. We prove that, as for a BN, identifiability analyses 
of causal effects can be performed through examining the topology 
of the CEG graph, leading to theorems analogous to the back-door 
theorem for the BN. 

1. Introduction. Bayesian networks have now been extended to Causal Bayesian Net- 
works (CBN's) using a non-parametric representation based on structural equation models 



11 



13, 



18, 



30(. These provide a framework for expressing assertions about what might hap- 
pen when the system under study is externally m anip ulated and some of its variables are 
assigned certain values. Motivated by comments in [25], we develop an alternative graphical 
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2 RICCOMAGNO AND SMITH 

representation of a causal model, called chain event graph model. This is constructed from 
an event tree together with a set of ex-changeability assumptions. It can be seen as a gener- 



alisation of a probability graph ^_25|] and typically has many less nodes than the original 
event tree. It was introduced in 28j in parallel with the present paper. In [2£j analogues to 
d-separation theorems that give sufficient conditions for determining whether a conditional 
independence statement holds, are given for CEG. Here a causal extension of CEG models 
is discussed which is as transparent and compelling as the extension from BN to CBN is. 

We refer to the introduction in [28| for a comparison of CEG models with other frame- 
works for propagations and for interrogating an elicited model like probability decision 



graphs 



131 ]. context-specific networks [3J], cofactors 



211 ] and case-factor diagram 



id- 



Here briefly we outline some of the reasons why chain event graph models are important. 

In some applications e.g. in Bayesian decision analysis jj]], risk analysis Q], physics [if]], 
biological regulation [5|, often the first stage of the elicitation of a model is based on the 
elicitation of an event tree. Here is an example of the type of context we have in mind. 

Example 1.1. The police hold a suspect S who they believe threw a brick through 
the kitchen window and stole a large quantity of money. The police hope to bring S to 
court (indicator X±) but might, for certain legal technicalities, be forced to release him. It 
is uncertain that the suspect was at the scene when the money was stolen (indicator X2), 
that he was the individual who threw the brick and stole the money (indicator X3), that 
the forensic service will find glass matching the window glass on the clothing of S (indicator 
X4), that a witness will identify S (indicator X§) and whether S will be convicted C or 
released R (the "effect" indicator of interest Xq). Unless the suspect is identified by the 
witness as the one who threw the brick, S will not be convicted. The glass match is believed 
only to depend on whether S was present at the crime scene or not whilst the quality of the 
witness identification is believed to depend on whether or not S was as at the scene of the 
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CHAIN EVENT GRAPHS 3 

crime as well as on whether or not he threw the brick. The police will later learn whether 
they have to release S before trial, whether the witness will identify S and the results of 
the forensic test. 

Such a problem has an event tree representation. However this is rather cumbersome. 
Recently it has been more usual to represent this problem using a BN. Thus given S is 
not released (i.e. conditional on X\ = 1) the following BN is consistent with the possible 
unfoldings of events as described above 

X2 — ► X3 — *• X5 
(1) \ I i 

X4 — > Xq 

Example 1.2 (Continuation of Example II .ip . Although the graphical representation in 
dU) is illuminating, it is not ideal. First, it is partial because the sample space that includes 
X% is not naturally a product space. Thus if S is released, forensic evidence will not be 
collected, and the witness will not be allowed to testify, so in this sense these variables do 
not exist under this contingency. Of course we can formally define X4 and X$ conditional 
on X\ = so that a BN is consistent with this story. One such candidate is given 

X2 — > X3 — » X§ 

\ i i 

X A — » X 6 <— X\ 

This stores well probabilities. However, as a qualitative representation of the the possible 
unfoldings of events — perhaps used as the basis for embodying causal conjectures — it is not 
ideal. See also Example 13.51 below. 

But suppose we use this BN representation of the problem. Second, we note that it only 
conveys certain aspects of the story. For example the fact that S could only throw the brick 
{X3 = 1} if he were present at the scene of the crime {X2 = 1} and that conviction {X§ = 1} 
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4 RICCOMAGNO AND SMITH 

requires {X§ = 1} are not expressed in the diagram. Moreover we might like to incorporate 
into the representation context specific information because it is informative about various 
causal hypotheses (see e.g. Q]). This is particularly so of models of biological regulatory 
mechanisms which typically contain many noisy "and" and "or" gates [lj. Third, we might 
well be interested in the causal effect of, for example, forcing the witness to identify S as 
the culprit {X5 = 1} if a match {X^ = 1} in the glass is found. This is not represented in 
the usual semantics of the BN above. Furthermore it will not necessarily be the case that 
the manipulations we want to represent correspond to setting the original variables in the 
idle system to certain values. 

The deficiencies of BN's as expressed in Example 11.21 should not be overemphasised. 
As stated in [J] there is an art to drawing the appropriate BN of a problem and it is 
sometimes necessary to redefine the variables defining the problem or add more edges on 
the graph to aid representation. For example were we to add an edge between X4 and 
X5 then the manipulation described in Example 11.21 can be expressed as a contingent 
decision (although we then have lost some information in the representation). Furthermore 
we can often transform the variables in a BN not only to encode more information but 
also so that the manipulation can be seen as setting this random variable to a value. 
Nevertheless the BN, whilst being consistent with a story like the one above, will still only 
be a partial representation in general. The CEG in Figure [T] gives a more expressive graphical 
representation, able to tell more of the story, which like the BN supports entirely graphical 
inferences about irrelevances and the potential effects of certain causal manipulations, albeit 
at the cost of some simplicity. 

The partially ordered sequence in which events unfold as expressed by the event tree 
is retained in the CEG construction. Event trees explicitly acknowledge asymmetries em- 
bedded in a structure both in its development and in its sample space. These are retained 
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CHAIN EVENT GRAPHS 5 

in the CEG. Recall that the dimension of the sample space often is critical for determin- 
ing identifiability especially when there are hidden variables, even in symmetric models 



26 



23, 



20]. 



In probability trees conditional independence relations can be embedded through equa- 
tions linking probabilities labelling the edges of the tree. These are used to construct the 
vertices of the CEG and its undirected edges. Many conditional independence statements, 
including all those associated with BN's, many context specific BN's, and also over func- 
tions of the variables (such as noisy and/or gates), can be stated as equality of distributions 
associated with vertices on the underlying tree. This means that all such statements are ex- 
pressed explicitly via the topology of the CEG. See Example 12. 71 for complete independence 
models. 

A basic assumption of our definition of causation (see Definition 13. ip is the belief that 
intervention or manipulation of one or more vertices of the tree/CEG can model external 
intervention on the underlying process being modelled. This is analogous to the do-operator 



in IS] and is consistent with the representation of the effect of a cause on a tree model 
in 2a], albeit there not necessarily as a result of a manipulation. It allows us to include 
naturally information relative to the background idle system in the analysis of causal effects. 

An advantage of the use of a framework based on event trees, and hence of CEG's, is that 
causal hypotheses can be explicitly separated from any direct link with the measurement 
process as will be illustrated later. 



251 ] . in particular on the advan- 



For a good discussion of many of the above points see 
tages of event trees (and hence CEG's) for coding asymmetrical problems, and as powerful 
expression of an observer's beliefs especially when those beliefs are based on an underlying 
conjecture about a causal mechanism 

Section [2] contains the basic terminology and definitions. The extension from event trees 
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6 RICCOMAGNO AND SMITH 

and CEG's to causal probability trees and causal CEG is in Section[3l A theorem of identifia- 
bility analogous to the back-door theorem is proved in Section |U Motivating and illustrative 
examples are presented throughout. 

2. Chain event graphs. From a Bayesian perspective probability trees describe the 
observer's beliefs about what will happen as events unfold. The edges out of a node v of the 
probability tree represent the possible unfolding that can occur from the situation labelled 
by v, or equivalently the event space of a random variable that can be indexed by v. The 
sample space of the experiment at a node is given by the branches of the probability tree 
at that point. 

Through two equivalence relations on the nodes of the tree, we construct a new model 
structure, called a chain event graph, which includes Bayesian Networks and which provides 
a natural framework for defining causality. 

We start with some definitions to set up notation and formalise ideas. Some of these 
definitions are slightly non-standard for reasons that will become apparent later in the 



24 



25, 



2a | and we refer to those works for further 



paper. Section [2.11 draws strongly from 
details. 



2.1. Probability trees. The model structure upon which a manipulation operation is 
defined in Section [3] is a graph constructed from a rooted, directed (event) tree T = 
(V(T), E(T)) where V(7~) and E(T) are the set of vertices or nodes and of edges, re- 
spectively. Write V = V{T) and E = E(T) when there is no ambiguity and assume V 
and E finite. The single root node is denoted by vq. Between two vertices there is at most 
on edge. Each edge e £ E is directed with a parent node v and a child node v' . We write 
e = (v,v') and note that this edge can be identified with its child vertex v' . 

Definition 2.1. For v e V let X(v) = {v' £ V such that there exists e G E such that 
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CHAIN EVENT GRAPHS 
e = (v,v')} and call X(u) the set of children of v. 



7 



Thus X(u) is in one-to-one correspondence with the set of edges out of v. If X(w) is 
the empty set, then v is called a leaf node, otherwise it is called a situation. The set of 
situations, S = S(T) C V(T), will have particular significance. Note that {X(*y) : v G S 1 } U 
{?;o} partitions V. 

A pat/i between two vertices v and ?/ is an ordered sequence of nodes A = \(v, v') = 
(vi, . . . , u n m+i) where v% = v, v n \\]+i = v' and Vk is the child of V}~-i and the parent of 
Vk+i for k = 2, . . . , n[A]. This path can equivalently be identified with the ordered sequence 
of its edges A = (ei, . . . , e n [\\) where = (vk, ffc+i) for k = 1, . . . , n[X]. The number n[A] 
of edges in called the length of the path. We write w£A whenever A contains the vertex v. 

Definition 2.2. Let X(T) = X = {A(u ,f) : v G V\S} be the set of root-to-leaf paths. 
The elements of X are called atomic events of T. 

Clearly, X(T) is in one-to-one correspondence with the leaves of T. Paths determine a 
partial order on V(T). A probability tree could be drawn so that the situations along each 
root-to-leaf path correspond to one of the possible historical developments of the modelled 
problem. In some cases this directionality is inherent to the problem to be modelled and 
expresses a conjecture about the ordering in which one situation follows another (see in 



particular 



23, Section 2.8]). 



2.2. Primitive probabilities. Next we impose a probability structure on T. Consider X as 
sample space, the power set 2 X , and a probability P on (X, 2 X ). Thus P(A) is the probability 
of the path A € X. A probability tree is a directed tree T = (V, E) such that to each situation 
v G S{T) is associated a discrete random variable X(v) whose sample space is ~K(v). The 
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8 RICCOMAGNO AND SMITH 

distribution of X(y), v G S(T), determines the primitive probabilities 

7r(e) = tt(v'\v) = ir(v') = P(X(v) = v) for v £ X(u), for e = (v,v') 

which are fundamental in the paper. Let Il a (T), or simply IT a , denote the set of all primitive 
probabilities and 7r(e) is the label or colour of e. 

The random variable X{v) can be interpreted as a function on X which assigns zero 
probability to the set of paths not through v. Furthermore, for A € X the formula X(v)(X) = 
v' states that X(v) maps A into the path through v and v' , hence the notation ir(v'\v). This 
gives an interpretation of X(v) as a function from X to X(t>). Example 12.51 provides another 
motivation for this notation. For every situation v G S{T) the sum-to-one condition can be 
written as 

(2) £ P(l(,) = ,')= E <v'\v) = l. 

v'ex(v) v'ex(v) 

Random variables on vertices along a path are required to be mutually independent. It 
then follows that the probability P(A) of the atomic event A = (ei,... ,e ra [ A j) G X is the 

following product of primitive probabilities 

n[\] 

(3) n^)=p(A). 

Furthermore if tt(v'\v) = for some v , v' then any atomic event including v' has zero 
probability of occurring. In some circumstances the branch starting at v' could be deleted 
from the tree. 

Above we have defined probabilities of the atomic events. Subsequently we defined some 
random variables over the probability space (X, 2 X ,P) and finally used these random vari- 
ables to define the primitive probabilities Il a . In practice when modelling a problem over a 
probability tree we would usually work the other way round. While constructing the tree 
from root to leaves the modeller has a (possibly not definite) idea of the distribution of 
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CHAIN EVENT GRAPHS 9 

the random variables sitting on situations. Then Equation ([3]) is used to determine the 
probabilities of the atomic events. The P's and 7r's provide equivalent parametrisations of 
the probability tree model under the assumption of strict positivity. We do not discuss 
this further, but give a simple example to show an algebraic reason by which we prefer to 
parametrise the probability space (X, 2 X ,P) using the primitive probabilities II a . 

Example 2.1. For the tree in Figure [6] Equations ([3j) are in the left-hand-side (LHS) 
of the tableau below 



p 2 = l-7Tl p 5 = 7Tl(l - 7T 3 - 7T 4 ) 

P4 = 7T17T4 p 7 = 7Ti7r 3 (l - 7T 6 ) 

p e = 7ri7r 3 7r 6 



1 P4 
ni = 1 - P2 vr 4 = 

1 ~P2 

1-P2-P4~P5 

7T 3 = : 

1 -Pi 
P6 

VT6 



1 - P2 ~ Pi ~ P5 

where pi is the probability of the path starting at the root vertex and ending in the vertex Vi , 
i G {1, . . . , 7} and 7Tj = ir(vi). The equations in the LHS of the tableau give the probability 
of an atomic event A as a polynomial in the primitive probabilities of degree equal to the 
length of the path A. 

The RHS of the tableau gives the primitives in terms of the atomic event probabilities. 

These are ratios of polynomials of degree one in the paths probabilities and are defined if 

the denominators are not zero. 

Note that p$ + pj = ttitt^ is the probability of the path {i>o,f2>^3}> and that ir^ = 
Pi P(X(v ) = vi,X(vi) = V4) 



P(X(vi) = V4\X(vq) = vi). Hence the notation 

1-P2 P{X(V )=V 1 ) 

it a = tt(v4\vi) is used to underline its interpretation as the probability of reaching v/± having 
arrived in v\. See also Shafer [251 ] . We will see later that this directional parametrization is 
a natural one to use when modelling causal hypothesis just as the directed parametrization 
of the BN naturally projects into the CBN. 



The primitive probabilities, which might be unknown or partially known, satisfy some 

imsart-aos ver. 2005/10/19 file: currentRevAnnals.tex date: February 2, 2008 



10 RICCOMAGNO AND SMITH 

logical constraints which often are algebraic. Obvious ones are < 7r(e) < 1 and the linear 
polynomials in Equations ([2]). Supplementary constraints might be inherent to the logic of 
the problem been modelled and others might be imposed by the modeller, e.g. in Figure[6]the 
modeller might know that X(vi) follows a Binomial distribution with P(X(v\) = 1)3) = s 2 
and V{X{v\) = V5) = (1 — s) 2 for some s G [0,1]. The definition of one class of such 
constraints leads to the notion of chain event graphs below. 

Example 2.2 (Continuation of Example 12. ip . Assume the primitive probabilities and 
the path probabilities are all unknown indeterminates and consider the polynomial ideal 
generated by G = {p2-(l-vri),p5-7ri(l-7r3-7r4),p4-7ri7r4,p 7 -7ri7r 3 (l-7r 6 ),p 6 -7ri7r37r 6 }. 
This is the infinite set of polynomials of the form J2 g eG s g9 where s g is any polynomial in 
P2,P4,P5,P6,P7, TTij 7T3, 7T4, ttq. The elimination ideal of the it variables is generated by the 
polynomial condition {p2 + p§ + Pa + p-j + Pq — 1}, obviously. The constraint tt^ = tt-^ttq 
is considered by adjoining the polynomial ir^ — ttsttq to G. The elimination ideal contains, 
obviously, the polynomial = pe- The propagation of polynomial constraints on the 7r's 
and p's is imposed by adjoining the corresponding polynomials to G and computing the 
relevant elimination ideal. This makes CEG's analysis amenable of techniques from algebraic 
statistics [20]. Note that the manipulations defined in Section [23] are polynomial constraints 
setting some 7r's equal to zero. 

2.3. Stages. 

Definition 2.3. Let T be a probability tree and Tl a (T) an associated set of primitive 
probabilities. Two situations V\,V2 £ S(T) are said to be stage-equivalent if and only if 
X(v\) and X(v2) have the same distribution. 

Definition 12.31 requires that there exists a one-to-one map \i : X(i>i) — > X^) and that 
t^{v\v\) = tt(h(v)\v2) for all v € X(«i). It determines an equivalence relation on S{T) whose 
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CHAIN EVENT GRAPHS 11 
equivalence classes are called stages. For each stage u define 

II(u) = {tt(v'\v) : v G for some v representative of u} 

and IT = II(T) = UugL(T) n(u). The set of primitive probabilities IT is no larger than the set 
of all primitive probabilities Il a and clearly still sufficient for determining the distribution 
of all random variables measurable with respect to (X, 2 X ). The pair (7~, n(7~)) is called a 
probability tree model. 

2.4. Examples. 

Example 2.3. Suppose that only three incidents A,B,C can occur and the order of 
occurrence of B and C is relevant and contingent on whether A or not A (A) happens. 
Their history unfolds according to the probability tree in Figure [7J For example the path 
(vq,vi,vs) represents whether after A has occurred, B occurs. The random variable X{vq) 
is the indicator function of the event of the incident A happening; tt{v\) is the probability 
of this event and the primitive tt{v^) is the probability that B occurs if A has occurred. 

Henceforth we shall assume that the tree fully represents all possible unfoldings of 
situations. The unfolding "if first A and then B, then C" is simply not part of our 
story. The probability of the incident of A does not happen and then B does, is given 
by tt(v2)^(vq) + 7r(v2)7r(i;7)7r(z;9) being the sum of the probabilities of the atomic events 
{vo,V2,vq} and {vo, V2,vj,vq}. Note that the path cr-algebra is not the cr-algebra generated 
by {A, B,C} where these are thought of as events, since this cr-algebra cannot express the 
ordering of incidents. 

Figure [7J could represent the following situation. A woman with epilepsy happens not 
want to conceive (A) or wants to conceive (A). She has three alternatives: to take medicine 
B, to take medicine C or to follow some other health regime, B U C. Medicine C has less 
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12 RICCOMAGNO AND SMITH 

side effects than B (thus we shall impose 7r(u4) >> tt(vs)) but it might have an adverse 
effect on the formation of the nervous system of the foetus if she becomes pregnant, thus 
7r(u7) < ir(v4). If she discovers pregnancy within the first three months from conception 
then the medicine B can be taken as a supplement to C to reduce greatly this adverse 
effect. But unless she discovers pregnancy she would never take B after having taken C 
because of other health risks to herself linked to the combined use of B and C. 

Example 2.4 (Continuation of Example 12 .3j) , The modeller might want to assert that 
ff( v 3) = ft( v 7) an d 7r(t>4) = tt(vq). This assigns v\ and V2 to the same stage. This assertion 
implies two things. First if the woman is not taking either B or C then the probability 
of changing her health regime is the same whether or not she wants to become pregnant. 
Second if she wants to become pregnant then she will prefer to take C as much as she is 
likely to take B if she does not want to become pregnant. 

Example 2.5. Let X and Y be two binary random variables. Let X = 1 be the event 
that a person watches a violent movie on Saturday night and Y = 1 the event of that person 
getting into a fight on Saturday night. In Figure [8] X (vq) = X, X(v\) = (Y\X = 1) and 
X(v2) = (Y\X = 0). The path (1)0,1)1,113) corresponds to having watched a violent movie 
and ending up in a fight and the path (vq, v%, v*,) to not having watched a violent movie and 
ending up in a fight. The three primitive probabilities n(v±), tt{v$) and k(v§) are sufficient 
to parametrise the model. 

Example 2.6 (Continuation of Example 12. 5p . In Figure [8] v% and V2 can be in the same 
stage in two ways: (1) ^{v^) = 7t(vs) and (2) tt(vs) = tt(vq). In (1) we assume that X 
and Y are independent, i.e. watching a violent movie has no effect on the probability of 
subsequent violence; while in (2) we assume that P(Y = 0\X = 0) = P(Y = 1\X = 1), i.e. 
the probability that violence occurs after seeing violence can be equated with the probability 
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CHAIN EVENT GRAPHS 13 

of non-violence occurring after not seeing violence — a type of conservation of violence law. If 
the values X and Y take were —1 and 1 then this last equality would imply the independence 
of X and XY. 

2.5. Chain event graphs. In this section we assume that the observer is able to express 
two pieces of qualitative information: the topology of the probability tree and its stages. 
These two sources of information can be fully represented using a mixed graph called a chain 
event graph. Mixed means that the graph has directed and undirected edges. Directed edges 
are labelled with primitive probabilities while undirected edges are not labelled. 

Let (T, I!) be a probability tree model. For a situation v £ S let T(v) be the sub-trees 



starting at v. In 



141 ] T{v) is called the subgraph induced by v and its ancestors. Let ILj be 



the subset of LT labelling edges in T(v). Then (T(v),H v ) is a probability tree model. 

Definition 2.4. Two situations v and v* in the probability tree (T, II) are equivalent 
if and only if 

1. T(v) and T(v*) are isomorphic. That is, there exists a map fi from the sets of vertices 
in T(v) and T(v*) such that (fi(vi) , n(v2)) is an edge in T(v*) if and only if (vx,V2) 
is an edge in T(v), and 

2. for every w situation in T(v), w and n(w) are in the same stage. 

The induced equivalence classes are called positions and K(T) is the set of positions. 

Item 2. in Definition 12.41 simply means that tt(v2\vi) = vr(/i(f2)|//(vi)) for all possible 
v\,V2 £ T{v). In Definition 12.41 we require that the sub-trees are topologically isomorphic 
and that their edge probabilities match according to the isomorphism, whether they are 
known fixed values, unknown fixed values or indeterminates. Clearly the partition of sit- 
uations into positions is a refinement of the partition into stages: if two situations are in 
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14 RICCOMAGNO AND SMITH 

the same position then they are in the same stage. Two situations are in the same position 
when the processes governing the stories unfolding from them are believed to be governed 
by processes with the same distribution. Note that this is a predictive, not a retrospective 
equivalence. Once a unit reaches the situation v or the situation v*, all pairs of possible un- 
foldings from v and v* occur with the same probabilities. It is in this respect that positions 
are natural objects on which to describe a causal manipulation. 

In broad terms, stages are used for estimation as they reduce the number of parame- 
ters (primitive probabilities) and positions are used to express conditional independence 



statements 28[. Positions are used to form the vertices of a new graph called the chain 
event graph. Its undirected edges join positions at the same stage and its directed paths 
correspond to root-to-leaf paths of the probability tree model. 

Definition 2.5. Let (T, II) be a probability tree model. Its chain event graph, C{T), is 
the mixed graph (V(C(T)), E d (C(T)), E U (C(T)),U(C(T))) where 

1. V(C(T)) = K(T) U {woo} is the vertex set. The vertex Woo is called the sink vertex. 

2. Ed{C{T)) is a multi-set of directed edges and is partitioned into two sets, E\(C{T)) 
and E2{C{T)) constructed as follows. For each w £ K(T) choose v € V(T) a repre- 
sentative of w. For each (v,v') edge in E(T) 

(a) if v' is in position w' then add a directed edge from w to w 1 to the multi-set 
EiiCiT)), 

(b) if v' is a leaf node then add a directed edge from w to Woo to the multi-set 
£ 2 (C(T)). 

3. E U (C(7~)) is a set of undirected edges joining positions in the same stage, namely 

E U {C{T)) = {{w,w'} : with w ^ w' and there exist v,v' £ S(T), 

u G L(T) with v,v' € u and v £ w, v ' £ w'} . 
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CHAIN EVENT GRAPHS 15 

4- II(C(T)) = II and if e\,e2 £ Ed(C(T)) and 7r(ei) = Tr(e 2 ) in the original tree, then e\ 
and e 2 have the same label in the CEG. 



When there is no ambiguity we write (Vc, E^, E U ,TL) and C for the chain event graph. 
The CEG fully expresses the structure X of the sample space of a tree because there are as 
many directed root-to-leaf paths in the original tree as there are root-to-sink paths in its 
CEG. However often it has many less edges. 

If two vertices v and v* of the original tree are in the same position, then for each 
path X(v,vm) in the sub-tree T(v) there exists a corresponding path X*(v*,v^) in T(v*) 
along which the same evolutions occurs. This implies that P(A) = P(A*). In particu- 
lar consider the root-to- leaf paths, given in terms of vertices, X(vq, . . . , v , . . . , vm) and 
A* (vq, . . . , v*, . . . , v* M * ) where vm and v* M * are leaves in T and v, v* are in the same position. 
Then 

(4) P(A) = P(X(v ,v))P(X(v,v M )) and P(A*) = P(X*(v , v*)) P(A(«, v M ))- 

The same formula holds when the paths are considered in the CEG. In this case vq is 
substituted by the root node and vm and v* M * become the sink node. 

Example 2.7 (Continuation of Example 12. 5p . Figure [9] gives the CEG when v\ and V2 
are in the same stage. The values of the edge labels indicates whether it is model (1) or (2). 

Example 2.8. Figures [10] and [11] give a tree and its CEG for the stage partition 
{{vo},{ v i> v 3i v l3,Vi7},{v2,V7},{v5,v 9 }, {vi 9 }} and the position partition {{v Q } , {vi,v 3 }, 

{V 5 ,V 9 }, {v 2 }, {V 7 }, {Vl 3 }, {v 17 }, {V19}, Woo}. 

Example 2.9 (Continuation of Example l2.4|) . The positions are {vq}, {v\}, {^2}, {^7}, w<. 
and the CEG is in Figure [T2l where 7Tj = ir(vi). 
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16 RICCOMAGNO AND SMITH 



Example 2.10 (Bayesian network). In 28] it is proved that any discrete Bayesian 



network on the random variables {X\, . . . , X n } can be expressed by a CEG. Example 12.111 
below shows how to retrieve the conditional independence statements of the BN from the 
topolo gy o f any of its CEG's. Like context specific BNs [3] but unlike the probability decision 
graph 



131 ] or the probability graph [J], the CEG provides a generalisation of the BN. In 
particular two situations and v' i _ 1 are in the same stage U{ if, and only if, the values of 
their parents agree. This fully expresses the conditional independence statement embodied 
in the BN. 

Topological characteristics of a CEG derived from a discrete BN include: (i) all the 
root-to-sink paths have the same length, (ii) the stages consist of situations all of whose 
distances (length of the path from the root to the situation) from the root are the same, 
and (Hi) for 2 < i < n all stages Uj associated with different configurations of parents of 



Xj contain exactly t 
CEG's are given in 



re same number of situations. Examples and d-separation theorems for 



Example 2.11. Consider the binary BN X 2 <- X 1 -> X 3 . Its CEG is in Figure [H 
Note that the statements that can be read from the topology of this CEG are that the two 
situations ({Xi = 0, X 2 = 0} , {X\ = 0, X 2 = 1}) and the two situations ({-X"i = 1, X 2 = 0}, 
{Xi = 1, X 2 = 1}) are in the same stages (respectively [0, 0] U [0, 1] and [1, 0] U [1, 1]). From 
the definition of a stage, this means the two conditional statements 

(5) P(X 3 = l|{x a = 0,^ 2 = 0}) = P(X 3 = l|{xi =0,x 2 = 1}) 
P(X 3 = 0|{xi = 0,x 2 = 0}) = P(X 3 = 0\{x 1 =0,x 2 = 1}) 

and 

(6) P(X 3 = l|{xi = l,x 2 = 0}) = P(X 3 = l|{a; 1 = l,X2 = l}) 
P(X 3 = 0|{x 1 = l,x 2 = 0}) = P(X 3 = 0|{xi = l,x 2 = l}) 
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This is synonymous with the statement X3.l_LX2l.X1. The CEG in Figure [H] contains the 
statement © but not necessarily ([6]), which cannot be expressed using a BN. 

3. Manipulation and Causality. 

3.1. Manipulations. A CEG provides a flexible framework for expressing what might 
happen were a model to be manipulated in certain ways or made subject to some control. Of 
course as Shafer [25] similarly argues for probability trees, the validity of such a framework 
is heavily dependent on context. Some discussions of notions of the manipulation of a 
system and intervention and various applications can be found in 
follow Pearl 



3, Q,Q 



30|. Here we 



la ]: a model for the manipulation is developed and the issue of suitability of 



such manipulation for the application under study is left to practical considerations. Recall 



briefly the standard definition of "do" -operator, which is fundamental in 18]]. The joint 
density function of a set of random variables Xi, . . . ,X n with sample spaces Xi, ... , X n , 
factorises according to a directed acyclic graph (DAG) 

n 

(7) p(xi, . . . ,x n ) = Y[p(xi\pa(xi)) 

i=i 

where pa(x{) are the parents of Xj in DAG language. A random variable is forced to assume 
a certain value with probability one, say Xj = Xj for some j E {1, . . . , n} and Xj E Xj. A 
new joint density, p(-\\xj), is defined on {Xi, . . . ,X n } \ {Xj} by the formula 

n 

i^' p(xi,...,Xj-i,x j+1 ,...,x n \\xj) = Y[ P{xi\pa{xi)) 

where pa{xi) is the subset of parents of xi for which Xj = Xj. This formula expresses the 
effect of the manipulation Xj = xj. 

A manipulation of a probability tree [of a CEG] can be defined in an analogous manner by 
modifying the distributions of some of the random variables sitting on situations [positions]. 
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18 RICCOMAGNO AND SMITH 

Definition 3.1. Let (T, II(T)) be a probability tree model and D C S a subset of 

situations of the tree. A manipulation of the tree is a pair (L>,Il£>) where Hd = {tt{v'\v) : 

v 6 D and v' € and {tt(v'\v) : v' G is a new distribution for X{v). 

P ^ 

The effect of this manipulation is the transformation (T, II) — > (T, TId) where 

n{v'\v) if v ^ D 
tt(v'\v ) if v € D 

for v' € X(w). The manipulated tree is the probability tree model so obtained. The manipu- 
lated CEG is the CEG of the manipulated tree. 

Definition 13.11 allows large classes of intervention, some of which are illustrated in Exam- 
ple ED 

Example 3.1 (Continuation of Example 12. 1|) . (1.) Fix some values of the primitive 
probabilities or of a function of them, e.g. tt\ = 1, 7T3 + W4 = 0.5, P4 = tx\-k^ = 0.5. (2.) 
Impose (polynomial) constraints on primitive probabilities, e.g. 713 = 2tt4, tti = tt^ = ttq. 
(3.) Assume that the distribution of the random variable sitting on some situation is from 
a parametric family, e.g. X(v\) follows a Binomial distribution with ~P(X(vi) = v%) = s 2 , 
P(X( Vl ) = v A ) = 2s(l - s) and V{X{v{) = v 5 ) = (1 - s) 2 for s £ [0,1]. (4.) In the idle 
tree X{v\) follows a Binomial distribution with -k^ = s 2 and tt^ = (1 — s) 2 and in the 
manipulated tree AT(^i) has a uniform distribution with 113 = ir^ = ir^ = 1/3. 

Here the distinction between intervention and constraint is not a mathematical one. A 
probability tree model is assigned together with extra information like its stages or some 
logical constraints (sum-to-one) or experimental regimes (women are randomly assigned to 
treatment B or C in Example 12 .3p . These pieces of information impose some constraints 
on the primitive probabilities. Then, the system is modified by changing one or more of 
the distribution of the random variables on situation. There might be no values of the 
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primitive probabilities that satisfy both the manipulated tree and the idle tree, as in the 
case of Example 13.11 (4) . Next we define classes of CEG models and of manipulations for 
which it is plausible to investigate whether the set of values of the primitive probabilities 
that satisfy both the idle and the manipulated CEG is not empty. 

If two unmanipulated situations were in the same stage of the original tree and are 
manipulated in the same way, then they remain in the same stage in the manipulated tree. 

Definition 3.2. A manipulation is called positioned if the partition of positions after 
the manipulation is equal to or a coarsening of the partition before manipulation. It is called 
staged if the partition of stages after the manipulation is equal to or a coarsening of the 
partition before manipulation. 

Example 3.2 (Continuation of Example 12 .6j) . Let v\ and vi be in the same stage with 
t>3 mapping into v$. The manipulation D = {^1,^2} and P(X(vi) = V3) = 1, P(X(v2) = 
1)5) = 1 is a staged and a positioned manipulation. The CEG of the manipulated tree is in 
Figure [15] where edges with zero probabilities are not drawn. 

Example 3.3. In the probability tree in Figure [10] the staged manipulation defined 
by vf(f5|fi) = 1, tt(vq\v3) = 1, rr(vn\vi3) = 1, Tr(vig\vn) = 1, leads to a modification of 
the CEG in Figure PTT1 in which the edges into Woo from the positions [^1,^3], [^13] and 
[^17] could be not drawn because the associated manipulated probabilities become zero. 
Indeed this manipulation could be performed directly on the CEG by removing the three 
aforementioned edges. This idea is developed in Section [3721 

Example 3.4 (Continuation of Example 13. 3p . The staged and positioned manipulation 
with D = {v2,vj} and tt(vq\v2) = 1, k(v 12 1 ^7) = 1 has the effect of cutting off the branch 
starting at V2 and going through vy. Figure [161 gives the CEG of the manipulated tree. 
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20 RICCOMAGNO AND SMITH 

A positioned manipulation manipulates all sample units identically when their future 
development distributions are identical, using the same (possibly randomising) allocation 
rule. A staged manipulation will treat sample units identically if their next development in 
the idle system is the same. Our experience has been that it is often sufficient to restrict 
study to positioned manipulations. We note for example that all manipulations on a BN 
considered by Pearl are positioned and also staged. Example 13.51 gives a simple case when 
a staged manipulation is not appropriate but a positioned manipulation is. 

Example 3.5. An English university has residence blocks of apartments with two rooms 
each. It allocates prospective second year students (either English (E) or Chinese (C)) to one 
of the two rooms of each apartment. The second room has to be allocated to a prospective 
first year student. In the past this has been done at random: that is exactly one of the N 
second year students to go into an apartment and exactly one of the N first year students 
is allocated the integer 1 < i < N using a randomization devise and students share with 
the student allocated the same integer. However it has been noticed in a survey that the 
probability of satisfaction of home students placed with home students is higher and of 
Chinese students placed with Chinese students is higher than when they are mixed. In 
order to cause students' satisfaction to increase, the university decides to place first year 
students with a second year student with the same ethnicity. 

The BN and CEG of this problem are given in Figure [T71 where X represents the ethnicity 
of the second year student, Y that of the first year student and Z is a binary index of 
the satisfaction of two students in the same apartment, taking values U and S. Thus for 
example X(v ) = X, X(vi) = [Y\X = E], X(v 3 ) = [Z\X = E,Y = E] and tt(v 5 \v 2 ) gives 
the probability of allocating a Chinese first year student to an apartment with a Chinese 
second year student. The vertices v$ and v§ are in the same stage to indicate a non-mixed 
apartment, analogous interpretation has the stage {^4,^6}- The undirected edge between 
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v\ and V2 represents the random allocation of the first year student to an apartment. 

The relationship between satisfaction and shared race is not depicted in the BN whilst 
it is in the CEG through the colouring of its edges. More significantly it is impossible 
to determine, either from the semantics of the BN or the factorisation of the probability 
mass function of the path events, whether the allocation of the prospective second year 
student occurs before the allocation of the prospective first year student. The CEG states 
that second year allocation occurs before first year allocation explicitly, so that "causal" 
manipulation of the type suggest by the survey above is a possibility. The semantic of a 
BN is not refined enough to represent the sort of manipulation considered in this example. 
Note that central to the BN analysis of causal relationships is the absence of edges between 

n 

vertices, here X and Y (see e.g. [6]). The only way to embody the types of manipulation 
we consider here is to join X and Y by an edge and so loose this intrinsic information. 

A manipulation that forces individuals of the same ethnicity to share an apartment 
implies a CEG without the directed edge between v± and V2 and without the crossing 
arrows in the CEG in Figure [TTJ. 

3.2. Manipulating CEG's. The standard manipulations of a BN are those that force 
some components of the network to take pre-assigned values, as in Equation (JH]). The 
analogue for the CEG is to consider manipulations which force all the paths to pass through 
an identified set of positions W . For example the assignment of a particular type of unit, 
here described by their current position, to a particular treatment regime, here described 
by a set of subsequent positions W (see also Section R~2j) . 

For a CEG C and a set of position W in C, let pa(PF) denote the set of all parents of 
the elements in W, that is pa(VF) = {w* E V(C) : there exists w £ W such that (w*,w) £ 
Ed(C)}. In the analogy above the set pa(PF) corresponds to the positions any unit must 
reach to be submitted to a treatment forcing them into the positions in W. 

imsart-aos ver. 2005/10/19 file: currentRevAnnals.tex date: February 2, 2008 



22 RICCOMAGNO AND SMITH 

Definition 3.3. A subset W of positions of a CEG C is called a manipulation set if 1. 
all root-to-sink paths in C pass through exactly one position in p&(W), and 2. each position 
in p&(W) has exactly one child in W . 

Example 3.6. In the CEG in Figure [15] the position [^1,^2] is a manipulation set. 
Excluding the trivial case of a manipulation set consisting of the root node only, there is 
no manipulation set in the CEG in Figure [TT1 In the CEG in Figure [T2l W = {[fi], [^2]} is 
not a manipulation set. The manipulation described in Example 13.51 is to a manipulation 
set. 

Definition 3.4. A manipulation (D,IId) of a CEG is called a pure manipulation to 
the positions W if 

1. it is a positioned manipulation, 

2. for each v € D there exists w G W such that P(X(v\D) = w) = Tt(w\v) = 1, and 

3. no v D is manipulated. 

A CEG to be causal for an application needs (i) to be valid for that application and (it) 
that for the pure manipulations to any manipulation set the corresponding manipulated 
CEG is also valid. If a CEG admits a description as a BN and the CEG is causal then 
the BN is also causal in the sense of [181 . Definition 1.3.1]. In this sense a causal CEG is 
a natural generalisation of a causal BN, applicable to asymmetric models. However it can 
express a larger variety of manipulations than a causal BN: for example those based on 
certain functions of preceding variables as in Example 13.51 Next, three manipulation on the 
CEG for Example 11.21 are discussed. 

Example 3.7 (Continuation of Example 1 1. 2 1) . Manipulation 1. Consider the manip- 
ulation forced to W\ — that can be read as ensuring the suspect is taken to court. This would 
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assign probability 1 to the edge labelled xi, all vertices other than Woo on paths after W2 
and their associated edges are deleted, and the primitive probabilities of the manipulated 
CEG are like in the unmanipulated CEG except the edge labelled x\. The manipulated 
CEG is in Figure [2j Notice here that the hypothesis that this new CEG is valid for the 
manipulated is a substantive one and in particular will depend upon the how we plan to 
implement the manipulation, for example if whether the suspect went to court simply de- 
pended on whether a judge could be found or on government policy. But if we choose not 
to proceed with prosecution because, on the basis of their evidence, a third party did not 
think the case was strong enough to convince a jury, then this causal deduction would 
almost certainly not be valid. Manipulation 2. Consider the manipulation forced to Wis, 
signifying e.g. that the witness does not identify the suspect. The associated causal CEG is 
given in Figure [2j Note that the inevitable consequence of this manipulation is that suspect 
is released as there is just the release edge R into Woo- 

4. Identifying effects of a manipulation. 



4.1. Identification of causal effects. Recent papers on causal BN literature 




study when the topology of a BN helps to deduce that the effect of a manipulation on a 
pre-specified node of the BN can be identified from observing a subset of the BN variables 
that are observed or "manifest" in an unmanipulated system. Experiments on the original 
"idle" system can then be designed so that the effects of, for example, a proposed new 
treatment regime, on a manipulated system can be established. Here we demonstrate that 
the topology of the CEG can also be used to find functions of the data, for example subsets 
of possible measurements, that when observed in the idle system allows us to estimate 
effects of a given manipulation. We first need some definitions. 
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4.1.1. Random variables. Consider (7~, II(7~)) with sample space X. Any function Y : 
X — > K is a random variable on (X, 2 X , P) and defines a partition of the atomic event set, 
namely 

(10) A y = {A G X such that Y(A) = y} 

where y ranges over the image of Y, written as Image(Y), which is a finite set. Similarly, 
given a partition of X, a random variable Y with finite range space can be constructed so 
that (HDD holds. 

Now consider a CEG C constructed from the tree. Let Xc be the set of root-to-sink paths 
in the CEG formed by directed edges. As already mentioned, Xc can be identified with X 
and the cr-algebra 2 X on the tree is mapped by the CEG construction into the power set of 
Xc, call it 2 Xc . Furthermore, the random variable Y corresponds to a random variable on 
(Xc , 2 Xc ) that induces on Xc the same partition as the one obtained by mapping the A y 
sets, y £ Image(Y), on the CEG. Thus, with a slight abuse of notation, for y G Image(Y) 
we can write A,, = {A G X such that Y(X) = y} = {A G Xc such that Y(X) = y}. 

Definition 4.1. A random variable Y on X is called observed (or manifest ) if, and 
only if, indicators of the events A y are observed or observable for all y G Image(Y). 

Although in practice vectors of manifest random variables are likely to occur, here, 
without loss of generality, we can work with random variables, because a partition induced 
by a random vector can be induced by a uni-dimensional random variable. 

4.2. Manipulation forced to a position. 

Definition 4.2. Call a manipulation of a CEG C forced to the position w if after 
manipulation 
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1. the probability of the event {w} = {A G X : w G A} is one, and 

2. all primitive probabilities in the manipulated CEG associated with positions at or after 
w in the original CEG are those in the idle system. 

Example 4.1. If {w} is a manipulation set of C then the pure manipulation to to is a 
manipulation forced to w. 

Example 4.2. In Example 13.51 a manipulation forced to w = {^3,^5} is obtained by 
setting tt(v^\vi) = = tt(vq\v2), that is by allocating students with the same ethnicity to 
the same apartment. This manipulation directly on the CEG is given by tt ({t>4, ^6}|{ u i}) = 

ft({v3,V 5 }\{v 2 }) = 0. 

EXAMPLE 4.3. In Example 12.41 the manipulation m = 1, forcing a woman who wants 
to have a child to take medicine C, is trivially a positioned manipulation, but not a staged 
manipulation. It is also a forced manipulation to w = {vj}. 

Consider a position w in the CEG C. Every directed root-to-sink path through w can be 
split into two parts: one from root to w and one from w to the sink node, Woq. Thus for 
A G Xc if w G A we can write A = \(wo, w) x X(w, Woo), where x indicates concatenation of 
paths. Note that {X(w,w oc ) : A G X^} is the set of root-to-sink paths in the sub-CEG of C 
starting at w, namely the CEG whose root is w, whose vertex set V is formed by positions 
in C lying on paths from w to Woo and whose edges are those in C connecting elements in 
V, likewise its edge labels. Call it C(w). 

Consider a random variable Y(w) on (C(w), 2 C ^) and let zero be one of the values not 
taken by Y(w). Let {A+(w) : y G lmage(Y (w))} be the partition induced by Y(w) on 
{A(u),Woo) : A G Xc}. It can be extended to a partition of Xc with sets Aq(io) = {A G Xc : 
uu G" A} and A. y (w) = {A = \(wq, w) x X(w, u>oo) : X(w, Woo) G A+(u>)} for y G Image(Y"(u;)). 
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Recall that there exists one, usually many, random variables Y(w) on (Xc,2 Xc ) which 
induce this partition. In Lemma l4.1l we compare the distributions of any such Y(w) and of 
Y(w) before and after a manipulation forced to w. 

Lemma 4.1. Consider a manipulation forced to w in the CEG (C,H). Let P be the 
probability measure on the unmanipulated CEG and P the probability measure on the ma- 
nipulated CEG, (C,LT). Let {w} be the event of passing through w in the idle system. Let 
Y(w) and Y(w) be defined as above. Then for y € Image(Y(u;)) 

1. P(Y(w)=y) = P(Y{w) = y), 

2. P(Y(w) = 0) = 1 - P({w}) and P(Y(w) = 0) = 0, 

3. P(Y(w) =y) = P(M) P(Y(w) = y), and 
I P(Y(w) = y) = P(Y(w)=y). 

Proof. 1. This follows from the fact that the primitive probabilities on C (w) are not 
changed by the manipulation. 

2. The probability of not passing through w in the manipulated system is zero because 
the manipulation is forced to w while in the manipulated system it is clearly one 
minus the probability of passing through w. 

3. This is by construction of Y(w) from Y(w). A path A through w is decomposed 
as A = X(wo,w) x X(w,vjoo). By Equation (jlj) its probability in the idle system is 
P(A) = P(X(wo,w))P(X(w,w 00 )). Now, P(X(wo,w)) is the probability of reaching w 
and P(A(w, Woo)) = P(Y(w) = y) by construction of Y{w) from Y{w) and the fact 
that the manipulation forced to w does not change the primitives probabilities for 
edges after w. 

4. In the manipulated system P({ii;}) = 1 and use Item 1. 

□ 
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Item 3 in Lemma l4.1l can be re- written as a conditional probability 

P(Y(w) = y\{w}) = ^^^^ = HYW = y). 

P({w}) 

Lemma [4. II can be applied if there is a position w such that, after enacting a manipulation, 
all paths pass through w and if the fact that {w} occurs can be learnt from a set of 
measurements in the unmanipulated CEG. This can be checked from the CEG topology. 

4.3. Manipulation forced to a set of positions: the sub-CEG C(W). It is not always 
possible, even in models that can be described by a causal BN, to observe indicators on 
the events {A y (w) : y G Image(Y)} for a suitable choice of w and Image(Y). Nevertheless 
being able to observe indicators of the set of coarser events A y (W) = [j A y (w) for a set 

w&W 

of positions, W, can also be sufficient for identifiability. To show this is less straightforward 
although the general set-up is a generalisation of Section 14.21 

Definition 4.3. A set of positions W of a CEG C is called C-regular if no two positions 
in W lie on the same directed path of C. 

Example 4.4. By definition, a manipulation set of C is always C-regular. 

The analogue of the sub-CEG starting at w for a C-regular set of positions, W, is a 
new CEG constructed by joining the sub-CEG's starting at each w £ W to a new root- 
vertex Wq. The new edge (wq,w) is labelled ~P(X(wq) = w) = , where, as before, 
P({w}) = J2\ex-.we\ P(^) i s the probability of passing through w in the original CEG and 
P(W) = J2 W £W ^>\eX:we\ PW is the probability of passing through a position in the set 
W in the original CEG. Note that because W is C-regular, J2 w <=w P(-^-( w o) = w) = 1 Let 
C(W) be this new CEG. 

Example 4.5. The manipulation in Example 13.31 is a manipulation forced to the set of 
positions W = {{vi,vs}, {^2}} and Wq = {vo}. A manipulation forced to W = {{^1,^3}, {^7}} 
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requires a new Wq and the edge (wq, {vj}) has the same probability as the edge ({^o}> {^2}) 
in the original CEG. 



It can be shown in analogy to Section 2^2] that the partition induced by a random variable 
Y(W) on (H c ( w \ 2 Xc < w '') can be extended to a partition on C and this, in turn, can be 
interpreted as the range space of a random variable on (X c , 2 Xc ) . 

4.4. Manipulation forced to a set of positions: amenable positions. Let C*(w) denote a 
graph representing what happens until we reach a given position w. Its vertices and edges 
are those along the root-to-u> paths in C and its edge probabilities are inherited from C as 
well. The graph C*(w) is not necessarily a CEG, for example in Figure [TT1 the graph C*([«7]) 
is not a CEG because of the undirected edge between {t^} and {^7}. Write K(C*(w)) for 
the set of positions in C whose vertices are in C*(w) excluding w. For any C-regular set of 
positions, W, let K(C*(W)) = \J K{C*{w)). 

Definition 4.4. Call a set of positions, W, simple if 

1. W is C-regular, 

2. there exists a partition of K(C*(W)) into K a (C*(W)) and K^(C*(W)) called active 
and background positions respectively such that 

(a) for each w G W an active position has exactly one emanating edge along each 
root-to-w path in C*{w) if it lies on that path. Furthermore for any two positions 
w\,u)2 € W every pair of root-to-w\ path and root-to-W2 path containing the 
ordered sequences of active positions ,w 1 , . . . w 1 and w 2 ,w 2 ,.. .w 2 
respectively, the pairs of position j (w^ , w 2 ,k ^J : 1 < A; < n| are in the same 
stage. If two active positions lie in the same stage in C*(w) then their unique 
emanating edge is labelled by the probability of the same edge in C. 
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(b) Each background position in C*(w) inherits a complete set of emanating edges 
from C. Furthermore for any two positions W\, wi € W every pair of root- 
to-wi path and root-to-W2 path containing the ordered sequence of background 
positions wf' 1 , w^ 2 , . . . Wi' n and w^^w 2 ^-, ■ ■ -wV; respectively, the pairs of position 
| (w± ' fc , W2 ,k ^j '■ 1 < k < nj are in the same stage. 

Note that active positions act as labels for the elements of W because each active position 
is in at most one path through w € W. Root-to-leaf paths through background positions 
are governed by the same probability law regardless of the index of w £ W. 

Example 4.6. A CEG of the binary CBN given by 

B -» X — ► Y <- A 
where X is manipulated to a value 1 is given by 

B = 1 X = 1 

A = 1 Wi' 1 z4 w±' 2 — > w\ 
/ | B=0 | \> 

Wo = W 01 ' 1 \ i Woo 

\ I B=l | // 
A = U ^2 =4 w 2 — > ^2 
£ = X = 1 

where the manipulation set is = {toi,^}- The active and background positions can be 
identified from C(W) thinking of C(W) as a subgraph of C with edge probabilities inherited 
from C. In this example active positions are ^wq, w"' 2 , w^ 2 ^ where |t«o,w"' 2 | = {^4 = 1} 
and ju;o,,^2' 2 } ={^4 = 0} label the two positions in W and tt)"' 2 ,^' 2 which lie at the 
same point in the ordered sequence to w\ and W2 lie on the same point in their respective 
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root to w\ and root to u>2 sequence and so lie in the same position. The background positions 
are w^ 1 and u^' 1 , retain all their emanating edges from the original CEG and are also in 
the same position. The terminology of active and background position here means that 
active positions might subsequently effect Y if manipulated after X while subsequently 
manipulating background positions cannot effect Y. 
Further examples are in Section [5j 

Definition 4.5. A manipulation is called amenable forcing to a set W if 

1. the set W is simple in (C, 11(C)), 

2. the set W is simple in (C, 11(C)) and P(W) = 1, and 

3. n(C) and 11(C) differ only on edges whose parents lie in K a {C*(W)). 

Item 2 in Definition 14.51 assumes that the manipulation is forced to W as the probability 
of passing through a vertex not in W in the manipulated system is zero. Furthermore, an 
amenable manipulation may change probabilities on edges associated to active positions, 
but will always leave probabilities associated with background positions unchanged. 

Example 4.7. When W = {w} is a singleton, the set of active positions will be empty 
and so all the conditions above are vacuous and W is simple. It follows that a pure manip- 
ulation forced to w is amenable. 

Lemma 4.2. Consider an amenable manipulation forcing to a simple set W . Then for 
w eW 

P(M) = *w P^M) and P({w}) = 5rfr P P ({w}) 

where P({w}) is the probability that a path in C will pass through the position w € W in the 
idle (C,II(C)) and P({w}) is the analogue probability in the manipulated system; and 
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P \{ w }) are products of primitive probabilities in 11(C) associated with random variables 
whose positions lie in K a (C* (W)) and K@(C*(W)), respectively. 



Proof. Item 2 in Definition 14 .41 means that events associated with background positions 
and with active positions are independent. Thus, for each w S W we have P({w}) = 
P°({w;}) P^({tt>}) where P^({u>}) is defined above and P°({u>}) is the product of primitive 
probabilities in n(C) associated with random variables whose positions lie in K a (C*(W)). 
Furthermore, from the definition of K a (C* (W)) for any positions w,w' G W we have 
P°<(M)=P a ({u/})=7rfr (say). 

The fact that W is also simple in (C, 11(C)) for the amenable manipulation implies that 
P({ty}) = 7r^P ({u;}) for all w £ W . Finally from Item 1 in Definition 14.51 we have that 
P({w}) = vr^ P P ({w}) and from Item 2 P({w}) = tt^ ({«>}). □ 

Lemma 4.3. Consider an amenable manipulation forcing to a simple set W . The dis- 
tribution of a random variable Y(W) defined on the sub-CEG C(W) is identified from the 
probabilities in the unmanipulated system of the events {Y(W) = y, W} for y G Image(y) 
where Y(W) is constructed as above and its probabilities are given by the equation 

provided that P({w}) > for all w € W. 
PROOF. Clearly for y G Image(y) 

where P(W) = £„, g vk p (M) = ^ E^vk p/3 (M) by LemmaH^J Analogously 

Hnw) = y\w) = p Jm^l 

V V ; ' 7 P(W) 
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where P(W) = 1 as the manipulation is forced to W and by Lemma l4~2l P(VF) = YlweW P({ w }) 
^wT. w &w p/3 (M)- Furthermore 

P(Y(W) = y) = ^2P(X(W)=y\W)P({w}) 

= J^^W P P ({w})P(Y(W) = y\{w}) by LemmalMl (8) 

= P P ({w})P(Y(W) = y\{w}) multiply and divide by vr^ 

= ^1 E P^(M)^P(*W = y\{w}) 

= 5 E p(M)p(?(w) = »IW). 

Set {Y(W) = = {-X"(«>) = y} and use Lemma HTH Item 1 to obtain 

p(r(wO = y) = J E p(M)p(^(w) = 2/). 

As for w e W it holds P({w}) = P({w}\W) P(W) then 



P(Y(W) = l/) = ^f E P(M|W)P(f (W) = y|M) = ^ HY(W) = y\W) 

Thus we have that P(Y(W) = y) is proportional to P(Y(W) = y\W) and, being probabili- 
ties, they must be equal. □ 

5. A back-door theorem for CEG's. In [18] sufficient topological conditions (see 
Definition 3.3.1 page 79) on a causal BN are given for when the probability of a random 
variable Y on the BN is governed by the formula 

PcOn = E p ( y = y\ x = x,z = z) p(z = z ) 

after X has been manipulated to take a value x. Here Z is a random vector of variables 
appearing as vertices in the BN and Z is its sample space. A particular consequence of 
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this formula is that P X (Y) can be calculated as a function of the marginal probability 
distribution of (X, Y, Z) and we do not need to observe any other variable in the system. 

Next an analogue formula is derived for CEG's. Let (Y, Z) be a random vector taking 
values (y, z) € Image (Y) x Image(Z) and let W be a set of positions in a CEG C. We 
will show that it is sufficient to know the probabilities of the events {Y = y,Z = z,W}. 
As above let P(Y(W) = y) denote the probability of Y after a manipulation to a set of 
positions W. 

For every z E Image(Z) let £l z be the set of root-to-leaf paths corresponding to the 
event {Z = z}. Assume that there exists a regular set of positions W z such that the set 
of root-to- leaf paths through W z is equal to Q z . Let C(W Z ) be the sub-CEG defined as in 
Section POl with new root vertex wq(z). 

Theorem 5.1. Consider a random variable Z on a CEG C and a manipulation forced 
to a set of positions W in C. 

(i) For z G Image(Z) let W{z) be the set of manipulated positions on root-to-leaf paths 
passing through a position in W z . 

(ii) Assume that if in C(W Z ) w' G W(z) and w £ W z are on the same root-to-leaf path 
then w' comes after w in the path ordering. 

(Hi) Assume that for each z € Image(Z) the manipulation in C{W Z ) is amenable forcing 
to W(z). 

Let Y(W) be a random variable on the manipulated CEG. For y € Image(Y(W)) and 
z £ Image(Z) let {Y(W Z ) = y} be the set of root-to-leaf paths for the event {Y(W) = y} 
intersected with the set of root-to-leaf paths through W z . As before, letY{W) be the extension 
ofY{W) to the unmanipulated CEG. Then 

(11) p(Y(W)=y)= ?(Y(W) = y\Z = z)P(Z = z). 

zCElmage(Z) 
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Proof. From (i) for each z £ Image(Z) the conditional events {Y = y,W\Z = z} = 
{Y = y,W(z)\Z = z}, y € Image(y) are all measurable with respect to the power set of 
C(W Z ). From (ii) P{Z = z) = P(Z = z), and by Lemma 14,31 and condition (Hi) we have 
P(Y{W Z ) = y\Z = z) = P(Y(W Z = y\Z = z). So 

P{Y(W Z ) = y) = P{Y{W Z ) = y\Z = z)P{Z = z) by the definition of W{z) 
= P(Y(W Z ) = y\Z = z)P(Z = z) by the two identities above 
= P(Y(W) = y\Z = z)P(Z = z) by the definition of W{z). 

□ 

The topology of each graph C(W Z ) is inherited from C except for the new root vertex 
wq(z) and its connecting edges. So the topological conditions on C(W Z ) given in Theorem 15. II 
are inherited as topological conditions on the idle CEG C. Theorem 15.11 is particularly 
useful when the topologies of C(W Z ), z £ Image(Z), are different, evoking different ways 
of satisfying the criteria of Theorem 15.11 for different configurations z. Of course, if it is 
possible to express a model using a BN, then C(W Z ), z € Image(Z) are all identical. 

Example 5.1 (Continuation of Example [33]). First year students making the university 
first choice {Z = 0} will be allocated a shared apartment on campus whilst others {Z = 1} 
will be lodged either in town W, namely {X% = 0}, or in town C, namely {X3 = 1} and 
have a friendly landlord {U = 0} or an unfriendly landlord {U = 1}. When {Z = 0} it is 
believed that the CEG of Figure [17] is valid. If {Z = 1} the town is chosen independently 
of the race X2 of the first year student, the friendliness of the landlord does not depend on 
the town or race of the student. However the satisfaction Y depend both on friendliness of 
the landlord and the allocated town C : C having a higher probability of higher satisfaction 
{Y = 1} than W, conditional on {Z = 1}. This scenario can be expressed as a BN on four 
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random variables in Figure [12] 

(12) X 2 U — > Y < — X 3 

We want to consider a proposed manipulation of our allocation policy for next year. We 
plan to match campus students so that those sharing an apartment are of the same race and 
to allocate off campus students only to lodgings in town C . Our interest is in P(Y(W) = 1) 
i.e. the overall predicted probability of high satisfaction were we to implement this policy. 
We plan to estimate this probability with a small data set, collected from earlier years. 
The sort of asymmetries exhibited by this problem makes extremely awkward to represent 
it through a single BN. Thus X\ is only defined for a student allocated to campus whilst 
(X3, U) only to students allocated to lodgings. Furthermore the manipulation proposed is 
different for different contingencies. 

The whole problem can be represented by the CEG in Figure [TBI where the set of positions 
manipulated are coloured in black. Note that {Z = 0} and {Z = 1} define a cut and for 
our proposed manipulation condition (i) in Theorem 15. II is satisfied. 

The topology of Cz=o is identical to the sub-graph of C consisting of all edges and vertices 
on root-to-leaf paths containing edge {Z = 0}. Similarly, Cz=i can be identified with the 
sub-graph of C consisting of all edges and vertices on root-to-leaf paths containing the edge 
{Z = l}. 

Since W(0) is a singleton, applying Lemma B~T1 to Cz=i gives us that 
P(Y(W(0)) = 1\Z = 0) = P(Y = 1\X = 0, Z = 0). 
A manipulation to W(l) on Cz=i is clearly amenable so that 

p(y(w(o)) = i\z = 1) = p(y = i\x = 1, z = 0). 

It follows that 

P(y(W(0)) = 1) = P(Y = 1\X = 0, Z = 0) P(Z = 0) + P(Y = 1\X = 1, Z = 1) P(Z = 1). 
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Thus P(y(W) = 1) is expressed as a function of only three probabilities from the idle 
system: the probability that a student is on campus, the probability that a campus student 
sharing with someone of the same race makes a favourable return given she shares the race 
of her room-mate and the probability that a non campus student residing in town C makes 
a favourable return. It follows that we have been able to deduce from the topology of C 
that the probability of the ethnicity of match pairs of campus students and the conditional 
distribution of returns of unmatched students are irrelevant to P. Furthermore, the race 
and probability of friendliness of the landlord for lodged students is also irrelevant to this 
calculation and need not be estimated. Note here that we have deduced what function 
of variables is sufficient to discover P: here X = \X\ — X2I/2, a feature that cannot be 
deduced directly from any BN on the original measurement variables. 

Corollary 5.1. Consider a causal CEG, Z and Y as in Theorem \5.1\ and a pure 
manipulation to a manipulation set W . If all the events {Y = y,W, Z = z}, y S Image(y) 
and z E Image(Z) are manifest, then the effect of the manipulation is identified and given 
by Equation whenever W is simple, conditioned on Z . 

Proof. It follows directly from Theorem l5.ll □ 

Example 5.2. Note that if a CEG of a BN is constructed so that the back-door vari- 
ables are introduced as early as possible compatibly with the ordering of the BN, then the 
conditions of Theorem 15.11 are satisfied for atomic interventions on a causal BN. 

Example 5.3 (Continuation of Example 13. 7|) . Manipulation 3. Consider a third ma- 
nipulation forced to the set of positions W = {u>i4,u>i6/ — i.e. the witness is forced to 
positively identify the suspect. The CEG C(W) is given in Figured! Suppose we learn the 
values of Z are 1 if a path passes through W7, 2 if through ws and 3 if 104, meaning re- 
spectively that the suspect threw the brick, the suspect was present but did not throw the 

imsart-aos ver. 2005/10/19 file: currentRevAnnals.tex date: February 2, 2008 



CHAIN EVENT GRAPHS 37 

brick, the suspect not present. The conditions of Theorem 15.11 are satisfied. For example 
conditioning on {Z = 1} gives us the graph in Figure [5j In this sub-graph {wj) is the only 
active position and {wo, w\, W3, w\o, uu\i} are the background positions. Similarly it holds 
for the sub-graphs associated with {Z = 2} and {Z = 3}. Therefore we can conclude that 
probability of conviction {Xq = 1} after this manipulation is 

3 

P(X 6 = 1) = £ P(Z = z) P(X 6 = 1\Z = z,X 5 = 1) 
2=1 

This probability can be estimated from data on similar cases, provided that in these cases 
the joint distribution of {Z, Xq\X^} is known and recorded. 

6. Discussion. Some problems, which are not satisfactorily expressed in terms of the 



exchangeable relationships in a BN (see 22J, [23j and Example I3.5D . can be well represented 



by a CEG model. This applies when the sample space is asymmetric as in the CEG in 
Figure [18] or the order of factors may be different for different settings of the factors as in 
Example 12.31 

Sometimes as in Example 13.51 causality is naturally expressed through predictions con- 
cerning the manipulation of unfolding situations rather than through assertions about 
the effects of manipulations on dependence relationships between measurements. Exam- 
ple [5J] shows that to determine the effect of a cause function of the measurements, namely 
\X\ — X2I/2, may be sufficient instead of resorting only to subsets of measurements like 
in BN's. BN technology encourages to express causal hypotheses in terms of the random 
variables and parametrization in which data are conveyed to us within a certain parametri- 
sation. It is now well appreciated that it is often necessary to separate causal structure 
from the dependence structures introduced into measurements through a particular sam- 
pling mechanism specific to the acquisition of information for a particular study. CEG 
modelling allows that. 

imsart-aos ver. 2005/10/19 file: currentRevAnnals.tex date: February 2, 2008 



38 RICCOMAGNO AND SMITH 

Note that the high number of vertices in a probability tree is reduced in a CEG by the 
sink node that collects the leaves and by the modelling constraints imposed by the position 
equivalence relation. However in some cases it can remain much larger than the number of 
factors in the modelled problem. For this reason we do not recommend the use of a CEG 
model where a simpler structure, like a BN, can be usefully employed. Furthermore the BN 
gives a representation of dependence structure that does not depend on the sample space 
of random variables whilst the CEG needs this sample space to be specified and be finite. 
Sometimes when addressing causal ideas we do not want the size of the nature or size of the 
sample space to intrude in which case we are again forced back on to a BN representation. 

Another difficulty with CEG's is that to our knowledge, the characterization of the 
equivalence classes both associated with CEG's whose probabilistic structure is identical 
and also those associated with graphs which can causally be identified are as yet only 
partially understood. This is in contrast to BN's where such equivalence classes can be 
determined — respectively by the pattern (or essential graph) or the BN itself. For tasks like 
model selection a good understanding of such equivalence classes is important to develop for 
CEG's. We plan to report on this topic in a future paper. Finally, whilst much more varied 
dependence relationships can be expressed using a CEG rather than a BN, CEG is also 
limited in the number of factorization formulae it can express simultaneously, being a graph. 
When such structure is present the CEG no longer provides a fully topological framework 
expressing all the contingent independence relationships and our criticisms of BN's apply 
equally to CEG's. We are then thrown on to causal analyses using more algebraic techniques 



241 ] . Despite these caveats we believe that the CEG's provide a powerful graphical tool for 
the study of implicit causal relationships derivable from its qualitative structure. 

We now turn to briefly discuss some generalisations. In the paper we considered the power 
set cr-algebra to limit technicalities. But generalisations of our results can be considered for 
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less refined cr-algebras. Searching over functions of measurements to find the cheapest way 
of identifying a quantity of interest will often be of much great value. This will be partic- 
ularly useful if those measurements have not yet been collected, or their parametrisations 
have been chosen by convention rather than because they reflect in some natural way the 
mechanism by which things happen. Search algorithms need to be developed. Example 12.11 
shows that techniques from algebraic geometry can be usefully employed on CEG's as they 
have been on BN's [h| and for identification of causal effects on BN's 22]. This is only in 



part being explored and could be combined with search algorithms and design issues. 
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Fig 1. CEG for Example Q77] 
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Fig 2. CEG for Manipulation 1 in Example [ff77| 




Fig 3. CEG for Manipulation 2 in Exarnple \3.T\ 
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Fig 5. Backdoor theorem for Examvle \5.3\ 




Fig 6. Primitive probabilities and atomic event probabilities 
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Fig 7. Stages and independence. 




Fig 9. CEG for Figure 
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Fig 1 1 . CEG for the event tree in Figure \W\ 
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Fig 12. CEG for the event tree in Figure^ 




Fig 13. CEG for binary X 2 <- Xi -> A 3 




Fig 14. CEG for condition £5|) but not necessarily (6$) 
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Fig 15. Manipulated CEG for Examvle X3J& 
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Fig 18. Modified university example 
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